Trees





Kerry Back

Decision Trees

  • Split data sequentially into subsets based on the value of a single feature
    • Above a threshold into one group
    • Below the threshold into the other
  • Prediction in each subset is the plurality class (for classification) or the cell mean (for regression).
  • Try to minimize impurity in classification and (usually) mean squared error in regression.

Example

Another example

Example: train from 2021-12, predict for 2022-01

  • Get data from the SQL database as before
df = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date in ('2021-12', '2022-01')
    """, 
    conn
)

Transform each cross-section

from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution="normal")

def qtxs(d):
    x = qt.fit_transform(d)
    return pd.DataFrame(x, columns=d.columns, index=d.index)

df[["roeq", "mom12m", "ret"]] = df.groupby(
  "date", 
  group_keys=False
)[["roeq", "mom12m", "ret"]].apply(qtxs)

Fit a regression tree

from sklearn.tree import DecisionTreeRegressor

Xtrain = df[df.date=='2021-12'][["roeq", "mom12m"]]
ytrain = df[df.date=='2021-12']["ret"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(Xtrain, ytrain)

View the regression tree

from sklearn.tree import plot_tree
_ = plot_tree(model)

Feature importance

  • What fraction of the splitting is each feature responsible for?
pd.Series(model.feature_importances_, index=Xtrain.columns)
roeq      0.713744
mom12m    0.286256
dtype: float64